K-Means Clustering - A Sparse Version of Principal Component Analysis
Introduction
K-Means algorithm is one of the popular clustering analysis techniques, that helps group similar data points into clusters and look for hidden insights. For example, by analyzing sales data, customers can be grouped into segments to help understand their behaviour. In one of the earlier posts, we explored Principal Component Analysis, a dimension reduction tool that transforms a number of (possibly) correlated columns into a (smaller) number of uncorrelated columns called principal components.
While principal components can help rank the records of the dataset, k-means just groups the records into, for example, good/bad. Principal components are essentially the continuous solutions to the discrete cluster membership indicators of K -means clustering[1] . In other words, K-Means is a sparse version of PCA. We explore this idea more with the European employment dataset by grouping similar countries. The Principal Components are then combined with the results of K-Means to see the similarities between the two techniques.
Required Packages
The packages listed below in the code chunk are required to replicate this project and generate this HTML document.
library(tidyverse) # The gas for the caR
library(cluster) # Cluster Analysis
library(factoextra) # Visualizing Clusters
library(plotly) # Interactive Plots
library(gridExtra) # Plot graphs in grids
library(DT) # Print HTML Tables
library(knitr) # Generating HTML document
library(rmdformats) # Document themeData
The dataset contains the distribution of employment in different sectors of 26 European countries. There are ten variables, the first being the name of the country, and the remaining nine holding the percentage employment of the countries’ citizens in nine different industries.
data <- read.table("https://raw.githubusercontent.com/mr-hn/kmeans-pca/master/european_jobs.txt",
header = TRUE, sep = "\t")
data_definitition <- data.frame(variable = colnames(data),
description = c("Name of the country",
"Percentage employed in agriculture",
"Percentage employed in mining",
"Percentage employed in manufacturing",
"Percentage employed in power supply industries",
"Percentage employed in construction",
"Percentage employed in service industries",
"Percentage employed in finance",
"Percentage employed in social and personal services",
"Percentage employed in transport and communications"))
data_definitition %>% datatable(caption = "Variable Descriptions")The dataset seems to contain USSR and East Germany, so I’m assuming this comes from at least from or before the 80’s. Data is read in and a box plot of distributions in each industries in presented below.
gather(data ,"industry","percentage") %>%
plot_ly(x = ~industry, y = ~percentage, type = "box",
hoverinfo = "text",
text = ~paste0(rownames(data), " - ", percentage,"%"),
line = list(color = "rgb(159, 33, 66)"),
marker = list(color = "rgb(33, 17, 3)")) %>%
layout(title = "Distribution of employments in different sectors - Europe, 1980\'s",
xaxis = list(title = "Industries"),
yaxis = list(title = "")) %>%
config(displayModeBar = F)K-Means Basics
K-Means clustering is one of the most commonly used unsupervised machine learning algorithms to group data into a pre-specified K clusters. The data points are assigned to the groups in such a way that the points are as similar to each other as possible. Each cluster is represented by a centroid, which is a vector of mean values of all columns. The within-cluster similarity is measured through the sum of squares of the residuals between the centroid means and all data points assigned to that centroid. For each cluster,
- \(C_j\) is the mean value of the column j in that cluster
- \(x_{ij}\) is the element belonging to row i and column j
- \(\mu_j\) is the average of the column j for all i in that cluster
For reference, this excellent post by Brad Boehmke goes into a lot more detail.
The K-Means algorithm is applied to the dataset with the clusters varying between 2 to 7. The countries are colored based on the cluster they are assigned to and presented below on the chloropleth. The number of clusters can be changed from the dropdown. For two clusters, the algorithm pits Turkey against the rest of the Europe. With seven clusters, we see the countries separated broadly into the central Europe, the Mediterranean, East and Scandinavia. Turkey and Hungary are in their own clusters.
# K-Means requires data to be scaled normally N(0,1).
# This is to make Euclidean distance measurement comparable.
data_scaled <- scale(data) %>% as.data.frame()
# Country code is combined to plot on map
country_code <- read_csv("country_code.csv") %>%
rename(country = "COUNTRY", code = "CODE")
data_kmeans <- data_scaled %>% rownames_to_column("country") %>%
left_join(country_code, by = "country")
# Combining the results of K-Means algorithm
data_kmeans$k <- 0
data_kmeans$cluster <- 0
# Into geo_data
geo_data <- NULL
for (i in 2:7) {
kmeans_model <- kmeans(data_scaled, i, nstart = 25)
data_kmeans$cluster <- kmeans_model$cluster
data_kmeans$k <- i
geo_data <- geo_data %>% bind_rows(data_kmeans)
}
# Drop unnecessary columns
geo_data <- geo_data %>% select(country, code, k, cluster)
# Change data structure for Plotly
geo_data_played <- geo_data %>% spread(k, cluster, sep = "value")
geo_colnames <- colnames(geo_data_played)
# Plotluy Chloropleth settings
l <- list(color = toRGB("grey"), width = 0.5)
g <- list(showframe = FALSE,
# lonaxis = list(range = c(-20, 50)),
# lataxis = list(range = c(30, 90)),
lonaxis = list(range = c(-10, 50)),
lataxis = list(range = c(30, 80)),
showcoastlines = FALSE,
projection = list(type = 'Mercator'))
# Build plot with data for two clusters visible
geo_plot <- geo_data_played %>% plot_geo() %>%
add_trace(z = ~kvalue2, color = ~kvalue2,
text = ~country, locations = ~code,
marker = list(line = l), showscale = FALSE) %>%
colorbar(title = "Scale") %>%
layout(geo = g,
title = "Countries grouped based on employment",
annotations = list(x = 0, y = 1, text = "Change clusters in dropdown",
showarrow = FALSE, xref = "paper", yref = "paper"))
# Add plots for clusters 3:7 invisible
for (col in geo_colnames[4:8]) {
geo_plot <- geo_plot %>%
add_trace(z = geo_data_played[[col]], color = geo_data_played[[col]],
text = ~country, locations = ~code,
marker = list(line = l), showscale = FALSE, visible = FALSE)
}
# Add dropdown for clusters and set corresponding visibility
geo_plot %>%
layout(
updatemenus = list(
list(y = 0.9, x = 0.08,
buttons = list(
list(method = "restyle",
args = list("visible", list(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)),
label = "2"),
list(method = "restyle",
args = list("visible",list(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)),
label = "3"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)),
label = "4"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)),
label = "5"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)),
label = "6"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)),
label = "7")
)))) %>%
config(displayModeBar = FALSE) We clearly see patterns emerging that we didn’t know existed before. Countries that are geographically and culturally closer had similar employment patterns. Unsupervised learning thus helped us draw inferences without a labelled response in the input dataset.
Determining the optimal number of clusters
Because the data was on a map, it was easier for us to say that 7 clusters grouped the data better than 2. But in other cases where the output may not be easily interpretable, it becomes essential to be definite about the number of clusters. There are three major methods to determine K.
Elbow- The number of clusters is plotted against the within residual sum squares and the cluster where there is a sharp drop is considered optimal.
Average Silhouette- The silhouette value is a measure of how similar an data point is to its own cluster compared to other clusters. The optimal number of clusters is the one that maximizes the average silhouette.Gap Statistic- This method computes the within-cluster variations and compares them against the clusters of another randomly generated reference dataset. The K that gives the highest difference is considered optimal.
plot1 <- fviz_nbclust(data_scaled, kmeans, method = "wss") +
ggtitle("By Elbow Method") + xlab("") + theme(text = element_text(size = 8))
plot2 <- fviz_nbclust(data_scaled, kmeans, method = "silhouette") +
ggtitle("By Silhouette Method") + xlab("") + theme(text = element_text(size = 8))
set.seed(1804)
gap_stat <- clusGap(data_scaled, FUN = kmeans, nstart = 25, K.max = 12, B = 50)
plot3 <- fviz_gap_stat(gap_stat) +
ggtitle("By Gap Statistic") + xlab("") + theme(text = element_text(size = 8))
grid.arrange(plot1, plot2, plot3, nrow = 2, ncol = 2)The result of the Elbow method is ambiguous, while three clusters have the highest average silhouette. The gap statistic method suggests one cluster as the optimum, which is impractical.
Going by the Silhouette method, three clusters broadly separate the data into western and eastern Europe, leaving out Turkey and Yugoslavia into the third cluster. This seems fair, and the rest of the document will work with three clusters.
data$cluster <- kmeans(data_scaled, 3, nstart = 25)$cluster %>% as.numeric()
data %>% rownames_to_column("country") %>%
ggplot(aes(x = reorder(country, cluster), color = as.factor(cluster))) +
geom_text(aes(y = 1),label = rownames(data)) +
coord_flip() + theme_minimal() +
theme(legend.position = "none", axis.ticks = element_blank(), axis.text = element_blank(),
axis.title = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank())Merging Results with Principal Components
PCA model is applied to the data and the results are printed below. It’s observed that the first three components explain ~75% of the variance. We will be utilizing only the three components to make plotting graphs possible.
pca_model <- princomp(data[1:9], cor = TRUE) %>% summary(loadings = TRUE)
eigen_values <- pca_model$sdev^2
pca_model$sdev %>% as.data.frame() %>% rownames_to_column("component") %>%
bind_cols(eigen_value = round(sqrt(eigen_values), 3),
proportion = round(eigen_values/sum(eigen_values), 3),
cumulative = round(cumsum(eigen_values)/sum(eigen_values), 3)) %>%
select(component, eigen_value, proportion, cumulative) %>%
datatable(caption = "Eigen Values of Principal Component Analysis")The data points are plotted on a 3-D plane of the first three principal components and the results from clustering are rendered as colors. The output makes it apparent how closely PCA and K-Means are related. While, PCA assigns each country to a point in the 3-dimensional plane, the role of K-Means is only to assign the country to one of the K clusters.
pca_model$scores %>% as.data.frame() %>% rownames_to_column("country") %>%
bind_cols(data) %>% select(Comp.1, Comp.2, Comp.3, cluster, country) %>%
rename("component_1" = Comp.1, "component_2" = Comp.2, "component_3" = Comp.3) %>%
plot_ly(x = ~component_1, y = ~component_2, z = ~component_3, color = ~cluster,
hoverinfo = "text", mode = "markers",
text = ~paste(country, "\n",
"Component 1 =", round(component_1, 3),"\n",
"Component 2 =", round(component_2, 3),"\n",
"Component 3 =", round(component_3, 3),"\n")) %>%
add_markers() %>% hide_colorbar() %>%
layout(scene = list(xaxis = list(title = 'Component 1'),
yaxis = list(title = 'Component 2'),
zaxis = list(title = 'Component 3'),
camera = list(eye = list(x = 1, y = -1, z = 1),
up = list(x = 0, y = 0, z = 1))),
title = "Data plotted against the three principal components") Matrix Algebra and Summary
K-Means algorithm is essentially an optimization problem where the objective function is to come up with centroids such that, after cluster assignment, the residual sum squares are minimized. Mathematically, this can be defined as
- C denotes a matrix of centroids of dimension
1bykwhere k is the number of clusters.
- W is the assignment matrix of dimension
krows andncolumns where n is the number of data points.
- X is the original vector. The double vertical bar in the equation denotes the Euclidean norm, which is root sum squares of the elements of the matrix.
The only constraint is that each of the data points can belong to only one cluster
In other words, K-means and PCA maximize the same objective function, with the only difference being that K-means has additional “categorical” constraint.
PCA is relaxed stuff Throw in a little matrix and images. Credit quora.
https://www.quora.com/What-is-the-relationship-between-K-means-and-PCA
PCA is relaxed k means
x <- data[,1:9] %>% as.matrix()
c <- pca_model$scores %>% as.data.frame() %>% as.matrix()
w <- pca_model$loadings[1:9,1:9] %>% as.data.frame() %>% as.matrix()
# X times w is c
scale(x) %*% (w) - c[1:26,1:9]## Comp.1 Comp.2 Comp.3 Comp.4
## Belgium 0.0332167105 0.0237263596 0.002228654 -0.0065927055
## Denmark 0.0185047170 0.0413201460 -0.018462382 -0.0115340869
## France 0.0146543959 0.0217730928 0.009669924 0.0097160530
## WGermany 0.0165559936 0.0002209256 0.011254019 0.0021452496
## Ireland -0.0020099344 0.0080393512 0.007457950 -0.0179951879
## Italy 0.0072901409 0.0149440902 -0.020596094 0.0286868330
## Luxembourg 0.0205736561 -0.0146776524 0.012651299 0.0162181674
## Netherlands 0.0327840341 0.0389327321 -0.001237825 0.0004566313
## United Kingdom 0.0316622195 0.0072459261 0.022155569 -0.0246019269
## Austria 0.0228458338 -0.0027789164 0.020261421 0.0030633490
## Finland 0.0192628042 0.0145315813 0.006993324 -0.0229231741
## Greece -0.0410409018 0.0068639774 -0.017050578 0.0058808305
## Norway 0.0327506193 0.0208896884 -0.025632021 -0.0181388589
## Portugal -0.0197163023 0.0147013031 -0.012735146 0.0140941846
## Spain -0.0084641502 0.0119688697 0.016300440 0.0484757858
## Sweden 0.0210467547 0.0306538587 -0.008109655 -0.0199130705
## Switzerland 0.0205481352 0.0144167342 0.011677310 0.0360886156
## Turkey -0.1232649387 0.0207038126 -0.017983030 -0.0242012994
## Bulgaria -0.0140906633 -0.0290661843 -0.019291525 0.0076164848
## Czechoslovakia 0.0082716557 -0.0517438833 0.002956439 0.0030601195
## EGermany 0.0345824398 -0.0546658368 0.002092961 -0.0125726608
## Hungary 0.0112310544 -0.0610435812 0.017937762 -0.0252708748
## Poland -0.0216275408 -0.0369986143 -0.010750172 -0.0005086611
## Romania -0.0398387508 -0.0311686322 -0.005175118 0.0263688816
## Russia 0.0009794154 -0.0245952062 -0.047052991 -0.0062077123
## Yugoslavia -0.0767073978 0.0158060580 0.060439464 -0.0114109665
## Comp.5 Comp.6 Comp.7 Comp.8
## Belgium 0.0063022638 0.0009176425 0.0066042725 -0.0078266708
## Denmark -0.0019936094 0.0160656512 0.0058825563 0.0068324122
## France 0.0058203358 -0.0022488947 0.0036018579 0.0051692773
## WGermany 0.0226279176 0.0120030724 -0.0086330399 -0.0037767500
## Ireland -0.0002955879 -0.0276569998 0.0007193472 0.0064868097
## Italy 0.0125290112 -0.0194601901 0.0027533130 0.0025299409
## Luxembourg 0.0168159062 -0.0042488739 0.0328997954 -0.0106287452
## Netherlands -0.0123347599 -0.0041164116 0.0058917804 0.0114696297
## United Kingdom 0.0157864622 0.0007000849 -0.0008017197 0.0067694759
## Austria -0.0101170947 -0.0155724934 -0.0080597451 -0.0041770827
## Finland -0.0099310476 -0.0015892887 -0.0132315018 -0.0008922046
## Greece -0.0188235573 -0.0161251007 -0.0022994251 -0.0115292273
## Norway -0.0181156553 0.0009380299 0.0042260995 -0.0154451009
## Portugal 0.0022222137 -0.0100338056 -0.0086083970 0.0007081194
## Spain -0.0271059983 0.0154647040 0.0019743434 0.0067633666
## Sweden 0.0105037078 0.0168913385 -0.0012135354 0.0073171374
## Switzerland 0.0116478821 -0.0034967085 -0.0204421368 -0.0074875863
## Turkey 0.0228825504 -0.0046059279 0.0002047022 0.0004265591
## Bulgaria 0.0111303191 0.0108040090 -0.0047307598 -0.0012479231
## Czechoslovakia 0.0047118006 0.0021599827 0.0009921593 0.0050597335
## EGermany 0.0132608901 0.0105108527 -0.0073086053 -0.0081659429
## Hungary -0.0179878953 -0.0149307320 -0.0001015487 0.0080639302
## Poland -0.0073852955 -0.0004050051 0.0085957365 0.0034303276
## Romania 0.0063537002 0.0058592817 -0.0002116560 0.0089065057
## Russia -0.0224854589 0.0154444011 -0.0015626294 -0.0012626097
## Yugoslavia -0.0160190007 0.0167313814 0.0028587366 -0.0074933819
## Comp.9
## Belgium 2.117491e-05
## Denmark -3.033060e-04
## France 9.853962e-06
## WGermany 1.269902e-04
## Ireland -2.112692e-04
## Italy -1.087823e-04
## Luxembourg -6.705685e-05
## Netherlands 2.122818e-04
## United Kingdom 1.063707e-04
## Austria 5.469302e-05
## Finland -1.166009e-04
## Greece 4.753113e-05
## Norway 1.767884e-04
## Portugal -5.747973e-05
## Spain 1.245094e-04
## Sweden 5.515711e-05
## Switzerland -3.514380e-05
## Turkey 1.682012e-04
## Bulgaria 6.437143e-05
## Czechoslovakia 2.047635e-04
## EGermany -1.439785e-04
## Hungary 2.961815e-05
## Poland 3.062710e-05
## Romania -1.466273e-04
## Russia -1.091524e-04
## Yugoslavia -1.335350e-04
## Agric Mining Manuf Power
## Belgium -0.020165929 -0.0072239379 0.001673859 -0.0004049209
## Denmark -0.012650250 -0.0235563191 -0.014716913 -0.0161968361
## France -0.010612100 -0.0092654855 0.001391259 -0.0004049209
## WGermany -0.015834860 0.0009422528 0.024847018 -0.0004049209
## Ireland 0.005183565 -0.0051823902 -0.017825507 0.0206509661
## Italy -0.004115496 -0.0133485808 0.001673859 -0.0214608079
## Luxembourg -0.014561016 0.0376901106 0.010717043 -0.0056688926
## Netherlands -0.016344397 -0.0235563191 -0.012738716 0.0048590508
## United Kingdom -0.020930235 0.0029838004 0.009021446 0.0259149378
## Austria -0.008191796 -0.0031408426 0.009021446 0.0259149378
## Finland -0.007809643 -0.0174316762 -0.003130333 0.0206509661
## Greece 0.028367524 -0.0133485808 -0.026586092 -0.0161968361
## Norway -0.012905019 -0.0153901285 -0.013021316 -0.0056688926
## Portugal 0.011043247 -0.0194732238 -0.007086726 -0.0161968361
## Spain 0.004801412 -0.0092654855 0.004217254 -0.0109328644
## Sweden -0.016599166 -0.0174316762 -0.003130333 -0.0056688926
## Switzerland -0.014561016 -0.0215147715 0.030499008 -0.0056688926
## Turkey 0.060723160 -0.0113070332 -0.053998244 -0.0425166948
## Bulgaria 0.005693102 0.0131915387 0.014956035 -0.0161968361
## Czechoslovakia -0.003351189 0.0336070153 0.023999219 0.0153869943
## EGermany -0.019019470 0.0336070153 0.040107391 0.0206509661
## Hungary 0.003272799 0.0376901106 0.007325849 0.0522347965
## Poland 0.015246932 0.0254408247 -0.003695532 -0.0004049209
## Romania 0.019832770 0.0172746340 0.008738846 -0.0161968361
## Russia 0.005820487 0.0029838004 -0.003412932 -0.0161968361
## Yugoslavia 0.037666585 0.0050253481 -0.028846888 0.0101230226
## Cons Service Finance Personal
## Belgium 0.0004165809 0.026586873 0.015523818 0.019071373
## Denmark 0.0016200367 0.007108700 0.017640703 0.035309922
## France 0.0088407717 0.016631362 0.014112562 0.007472409
## WGermany -0.0104145216 0.006243004 0.007056281 0.006602487
## Ireland -0.0080076099 0.016631362 -0.008467537 0.002252876
## Italy 0.0220787858 0.022258390 -0.016935075 0.000223057
## Luxembourg 0.0124511392 0.023989783 0.004233769 -0.002386710
## Netherlands 0.0208753300 0.021825542 0.019757587 0.024580881
## United Kingdom -0.0152283449 0.017064211 0.011995678 0.024000932
## Austria 0.0100442275 0.016631362 0.006350653 -0.009346088
## Finland -0.0092110658 0.007541549 0.010584422 0.012401969
## Greece -0.0007868750 -0.006309596 -0.011290050 -0.026164585
## Norway 0.0052304042 0.017064211 0.004939397 0.021971114
## Portugal 0.0028234925 0.001481673 -0.009173165 -0.009636062
## Spain 0.0401306233 -0.014100865 0.031753265 -0.023844792
## Sweden -0.0116179774 0.006243004 0.014112562 0.035889870
## Switzerland 0.0160615067 0.019661300 0.009173165 -0.013405725
## Turkey -0.0645700340 -0.033579037 -0.020463215 -0.023554818
## Bulgaria -0.0031937866 -0.021459285 -0.023285728 -0.005286451
## Czechoslovakia 0.0064338600 -0.016265106 -0.021874471 -0.006156373
## EGermany -0.0068041541 -0.007608141 -0.019757587 0.006022539
## Hungary 0.0004165809 -0.015399410 -0.021874471 -0.008186192
## Poland 0.0028234925 -0.023623527 -0.021874471 -0.011375907
## Romania 0.0064338600 -0.030549099 -0.019051959 -0.024134767
## Russia 0.0124511392 -0.029683403 -0.024696984 0.010372150
## Yugoslavia -0.0392974615 -0.028384858 0.051510852 -0.042693108
## Transp
## Belgium 0.009305784
## Denmark 0.007882547
## France -0.012042780
## WGermany -0.006349829
## Ireland -0.006349829
## Italy -0.012042780
## Luxembourg -0.004926592
## Netherlands 0.003612834
## United Kingdom -0.002080116
## Austria 0.006459309
## Finland 0.014998734
## Greece 0.002189596
## Norway 0.040617011
## Portugal -0.012042780
## Spain -0.014889255
## Sweden 0.003612834
## Switzerland -0.012042780
## Turkey -0.047623719
## Bulgaria 0.002189596
## Czechoslovakia 0.006459309
## EGermany 0.026384635
## Hungary 0.020691685
## Poland 0.005036071
## Romania -0.022005443
## Russia 0.039193773
## Yugoslavia -0.036237818
## attr(,"scaled:center")
## Agric Mining Manuf Power Cons Service
## 19.1307692 1.2538462 27.0076923 0.9076923 8.1653846 12.9576923
## Finance Personal Transp
## 4.0000000 20.0230769 6.5461538
## attr(,"scaled:scale")
## Agric Mining Manuf Power Cons Service
## 15.5465692 0.9700436 7.0077627 0.3762160 1.6455862 4.5752528
## Finance Personal Transp
## 2.8065637 6.8295422 1.3914685
For HOMEWORK
# ```{R}
kmeans_model <- kmeans(data_scaled, 3, nstart = 25)
fviz_cluster(kmeans_model, data = data_scaled) + theme_minimal() + ggtitle("My title")
data$cluster <- kmeans_model$cluster
data
pca_model$scores
pca_model$loadings
histo_viz <- list()
for (i in colnames(data_scaled)) {
mean_variable <- paste0('mean(', i, ')')
histo_viz[[i]] <- data %>% group_by(cluster) %>%
summarize_(.dots = setNames(mean_variable, "mean")) %>%
ggplot(aes(cluster, mean)) + geom_col() + coord_flip() +
ylab(paste("Average", i)) + xlab("Cluster")
}
grid.arrange(grobs = histo_viz, nrow = 3, ncol = 3,
top = "Histograms to visualize data ranges" )